taken from: https://arxiv.org/pdf/1706.03762.pdf
Important things to note are:
Why Self-Attention?
Self-attention allows the model to treat words with weighted importance:
taken from http://lucasb.eyer.be/transformer
Multi-Head Self-Attention
Transformer models do not see single characters:
taken from https://platform.openai.com/tokenizer
Architecture of some important Transformer models
taken from Lukas Beyer again
taken from Chip Huyen
Colossal Cleaned Crawled Corpus (C4) This is 800GB of cleaned common internet crawl. https://github.com/google-research/text-to-text-transfer-transformer#c4
BookCorpus "The books have been crawled from https://www.smashwords.com, see their terms of service for more information."
Stack-Exchange preferences
Instruction Data-Sets
taken from Lilian Weng
Data-Sets "stolen" from ChatGPT
The project can be found here
Reinforcement-Learning with human feedback (RLHF)
Chain-Of-Thought (COT) Training
Lineage of Chat-GPT
taken from: How does GPT Obtain its Ability?
Distilling ChatGPT: "Our data generation process results in 52K unique instructions and the corresponding outputs, which costed less than $500 using the OpenAI API."
Here is a link to the 'open source' models and there performance.
LORA (Low-Rank-Adaptation)
This is corresponding paper.
The pretrained weight matrix $\mathbf{W}$ is frozen during training. An additional weight matrix is trained with the two low-rank (r) matrices $\mathbf{A}$ and $\mathbf{B}$. Only these weights (orange) are update. The input-vector (dark-blue) is multiplied with the frozen weights as well as with the low-rank-adaption of the weight matrix. The Results are just added.
During training only the gradients for the orange matrices have to be kept in GPU-memory.
taken from Sebastian Raschka
Bits and Bytes
Tim Dettmers et al., 2022
QLora by Dettmers et al., 2023
In few words, QLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU. see here
illustration taken from here
Prompt Engineer will not be a job to stay.
From Mishra et al., 2022:
Prompting is only to imitate as well as possible the training data.
Fiction
Since most models are also trained on one or several book corpora, they can also be prompted to take on a fictional persona.
Zero-Shot CoT prompting
Let's think step by step
Remember Byte-Pair-Encoding:
For more funny examples with "O" see this twitter feed.
Why Large Language Models can not calculate with large numbers:
Image('../images/rip_prompt_engineer.png')